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Abstract 



When gene copies are sampled from various species, the resulting gene tree might 
disagree with the containing species tree. The primary causes of gene tree and species 
tree discord include lineage sorting, horizontal gene transfer, and gene duplication and 
loss. Each of these events yields a different parsimony criterion for inferring the (con- 
taining) species tree from gene trees. With lineage sorting, species tree inference is to 
find the tree minimizing extra gene lineages that had to coexist along species lineages; 
with gene duplication, it becomes to find the tree minimizing gene duplications and/or 
losses. In this paper, we show the following results: 

(i) The deep coalescence cost is equal to the number of gene losses minus two times 
the gene duplication cost in the reconciliation of a uniquely leaf labeled gene tree and a 
species tree. The deep coalescence cost can be computed in linear time for any arbitrary 
gene tree and species tree. 

(ii) The deep coalescence cost is always no less than the gene duplication cost in the 
reconciliation of an arbitrary gene tree and a species tree. 

(iii) Species tree inference by minimizing deep coalescences is NP-hard. 

Index terms: Reconciliation of gene tree and species, deep coalescence, gene duplication 
and loss, parsimony criterion, NP-hardness. 

1 Introduction 

Gene trees are fundamental to molecular systematics. Traditionally, a gene tree is recon- 
structed from DNA sequence variation at individual genetic loci in a group of species and 
is taken as the phylogenetic tree of the species due to sequencing technology limitations. 
However, when gene copies are sampled from various species, the resulting gene tree might 
disagree with the species tree. As such, the relationship between gene trees and species trees 
has been the focus of many studies (see for example [5j [12] ISHl (25] |27J [3TJ [33]). It has long 
been recognized that gene trees can be used to estimate species divergence time, ancestral 
population sizes and even the containing species tree although they may not accurately reflect 
the species tree [7] [15], [21] . 

The discord of gene trees and the containing species tree can arise from horizontal gene 
transfer, lineage sorting, and gene duplication and loss. The importance of these causes 
depends on the considered genes and species. Hence, inferring the species tree from gene trees 
has been investigated under various parsimony criteria. With lineage sorting (also called deep 
coalescence) , the problem is to find the tree minimizing extra gene lineages that had to coexist 
along species lineages (20] ; with gene duplication, it becomes to find the tree minimizing gene 
duplications and/or losses [12] [25] [13] [28] . 

Inferring the species tree from a set of gene trees has often been studied under the gene 
duplication cost [U E] [3] [6] [8] [14] HE] HE] [30] [34] until very recently. In a seminal work 
[20] . Maddison addressed lineage sorting in the framework of coalescence theory. Coalescence 
theory is an active branch of population genetics concerned with tracing the genealogical 
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history of a present-day gene copy. For a gene sampled from two individuals, one may ask: How 
deep in time do these two lineages coalesce? Hence, the depth of this coalescence is a measure 
of the relationship between two sampled gene copies. The more deep in time coalescence 
occurs, the more distantly related they are. Maddison proposed to use the total number of 
"extra" gene lineages that fails to coalesce on a species tree to measure the inconsistence of 
a gene tree and species tree, called deep coalescence cost. For the gene tree and species tree 
shown in Figure [U there are three gene lineages on a branch and two gene lineages on another 
branch that fail to coalesce, giving the deep coalescence cost of 3. Since coalescence theory 
provides the probability that a gene tree would exist in a species tree, it allows the inference 
problem to be studied in explicit statistical framework jU [29]. This seems to give the deep 
coalescence model an advantage over the other models. 

The paper is a sequel of [19], which studies the complexity and algorithmic issues of infer- 
ring the species tree from a set of gene trees with the gene duplication/loss cost. In this work, 
we present a relationship of the deep coalescence cost, the duplication cost, and the number 
of gene losses. Although deep coalescence and gene duplication are two different mechanisms 
responsible for the discord of gene trees and species trees, this relationship suggests that the 
deep coalescence cost and the duplication cost are closely related to each other as a similarity 
measure of trees. We further show that inferring species tree from gene trees is also NP-hard 
by minimizing the deep coalescence cost. 
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(iii) 

Figure 1: (i) A gene tree, (ii) A species tree, (iii) The reconciliation of the gene tree in (i) 
into the species tree in (ii) has deep coalescence cost 3. 
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2 Basic definitions and notations 



In this section, we shall introduce basic definitions and notations on gene duplication, gene 
loss and deep coalescence that are used in the following sections. 

2.1 Species trees and gene trees 

For a set of n taxa, their evolutionary history is modeled as a rooted, full binary tree with n 
leaves in which leaves are labeled with taxa, representing the labeling taxa, and internal nodes 
are unlabeled. Here, the 'fullness' means that each internal node has exactly two children. 
Such a tree is called species tree. In a species tree, each unlabeled internal node is considered 
as a taxon family which include as its members the subordinate species represented by the 
leaves below it. Thus, the evolutionary relation "m is a descendant of ra" is expressed using 
the set-theoretic notation as "m C n" . We also call an internal node an ancestor of the species 
below it. 

The model for gene relationship is also a rooted, full binary tree with leaves representing 
genes, called a gene tree. Usually, a gene tree is reconstructed from a collection of gene family 
members sampled from the considered species. We label the gene copies by the species from 
which they are sampled. Thus, leaf labels may not be unique in a gene tree as two or more 
gene copies might be found in a species. An internal node g corresponds to a multiset of leaf 
labels. 

Finally, for a species or gene tree T, we use L(T) to denote the set of leaf labels of it. For 
an internal node t in T, a(t) and b(t) are used to denote its two children. 

2.2 Gene duplication 

Let G be a gene tree and S a species tree such that L(G) C L(S). For any nodes s', s" in S, 
the least common ancestor of s' and s" is defined to be the smallest node s in S such that 
s', s" C s, which is denoted by lca(s', s"). To reconcile the gene tree G and the species tree S, 
each node g of G is mapped to a unique node M(g) in S as 



This mapping M was first considered in [12] and then formulated in [25J. We call M the lea 
mapping or reconciliation of G in S. Obviously, if g' C g, M(g') C M(g). 

Definition 2.1 Let g be an internal node of G. If M(c(g)) = M(g) for some child c(g) of g, 
then we say that a duplication occurs at M(g) (or more exactly in the lineage entering M(g)) 
in S. 

The total number of duplications arising in the lea reconciliation G in S is proposed to 
measure the discord of the gene tree and species tree and is called the duplication cost. We 
use Cdu P (G, S) to denote the duplication cost for G and S. Note that the duplication cost is 
not symmetric. 
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2.3 Gene loss 



A subset A of (internal and/or leaf) nodes of a species tree S is incompatible if x D y — for 
any x,y £ A. For an incompatible subset A in 5, the restriction of S on A is the smallest 
subtree of S containing A as its leaf set, denoted by Rs(A). It is easy to see that the root of 
Rs{A) is the least common ancestor of the nodes from A. The homomorphic subtree S\a of 
S induced by A is a tree obtained from Rs(A) by contracting all degree-2 nodes except for 
the root of R S (A). 

Let G be a gene tree such that L{G) C L(S). S\l(g) is well defined. To reconcile G and S 
in this general case, we consider the the lea mapping M from G to S\l(g)- F° r an Y two nodes 
s and s' of SIz^g) such that s C s', we define 



That is, d(s, s') is the number of nodes on the path from s' to s. 

Recall that a(g) and b(g) denote the children of g. The number of losses l g associated to 
g is defined as 



This definition of l g is a generalization of the loss cost given in [13] . When L(G) = L(S), our 
definition is then identical to the one given in [13j . 

The gene loss cost in the reconciliation of G in S is defined as the total number of losses 
^2 geG l g - We denoted this gene loss cost for G and S by ci 0SS (G, S). 

2.4 Deep coalescence 

Let G be a gene tree and S a species tree such that L(G) = L(S). Under the lea mapping 
M : G — > S, if a branch e of S is on the k paths from M(gi) to M(c(gi)), gi G G (1 < % < k) , 
then we say that there are k — 1 'extra' lineages on e failing to coalesce on e. The deep 
coalescence(DC) cost is defined as the total number of the 'extra' lineages on all branches 
of S in the reconciliation M of G in S (see [20]), which is denoted by Cd c (G, S). Note that 
the concept of deep coalescence is meaningful only if S has 2 or not leaves. We assume this 
throughout the paper. 

In general, if L(G) C L(S), the deep coalescence cost Q C (G, S) is defined as td c (G, S\l(G)), 
where Sl(g) is the homomorphic subtree of S induced by L(G). Such a generalization will be 
used in the study of inferring the species tree from a set of gene trees 

3 A equation of the duplication and DC costs 

We have seen that deep coalescences, gene losses and duplications are inferred through the 
gene tree/speceis tree reconciliation. Actually, they are indeed closely related through a simple 
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equation. 

Definition 3.1 Let G be a gene tree and S a species tree such that L(G) C L(S). Under the 
lea mapping M : G — » S, an internal node g G G is of 

• type-1 if M(g') C M(g) for each child g' of g; 

• type-2 if there exists a unique child g' such that M(g') = M(g); 

• type-3 if M(g') = M(g) for each child g' of g. 

Note that type-2 or type-3 internal nodes correspond one-to-one with duplication events. 

Theorem 3.1 Let G be a uniquely leaf-labeled gene tree and S a species tree such that L(G) = 
L(S). Then, 

Cdc{G, S) = ci oss (G, S) — 2cdu P (G, S). 
Proof. Let G and S have n leaves. Assume that there are h\ type-1 internal nodes 

#11, #12, • • • 5 <?lfci> 

k 2 type-2 internal nodes 

#21, #22, • • • , #2^2, 

and &3 type-3 internal nodes 

#31, #32, • • • , #3fc 3 

in G under the lea mapping M : G — >■ S, respectively. Since G is a full binary tree with n 
leaves, G has n — 1 internal nodes and hence 

k 1 + k 2 + k 3 = n-l. (1) 

Additionally, type-2 and type-3 nodes correspond one-to-one with duplication events, 

c dup (G,S) = k 2 + k 3 . (2) 

For simplicity, we assume that g' and g" are the children of g for each type-1 internal node 
g; we also assume that a(g) is the unique child such that M (a(g)) C M(g) for each type- 
2 node g. Since we use d(M(h),M(g)) to denote the number of nodes on the path from 
M(g) to M(h) for a node g and its child h, the number of lineages contained in the path is 
d (M(h),M(g)) + 1. Therefore, by Eqn. © and © and the fact that \E(S)\ = 2n - 2, 

fci 

c dc (G, S) = J2ii d Wi 3 ), M( 9lj )) +l] + [d (M(^), M( 9lj )) + 1] } 

+ Y / [d(M(a(g lj )),M(g lj )) + 1] - \E(S)\ 
i=i 

= c l0SS (G,S) + 2k 1 -{2n-2) 
= c loss (G,S)-2(k 2 + k 3 ) 
— cioss\G, S) — 2cdu P (G, S). 
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This concludes the proof. □ 

Remarks. (1) Following the proof of the equation in the above theorem, one can easily see 
that for an arbitrary gene tree G in which there may be two or more gene copies are from the 
same species and a species tree S such that L(G) = L(S), 

c dc {G, S) = ci oss (G, S) - 2c dup (G, S) + {no. of genes) - {no. of species). 

(2) Since the number of gene duplications and lossed can be calculated in linear time 
[3U [19] , the first remark implies that the deep coalescence cost can also be computed in linear 
time. 

By Thm I3.ll Cd c {G, S) < ci oss {G, S) for a species tree S and a uniquely leaf labeled gene 
tree G. Now we show that it is bounded below by the duplication cost for any arbitrary gene 
tree. 

Theorem 3.2 Let G be a uniquely leaf-labeled gene tree and S a species tree such that L{G) = 
L{S). Then, c dc {G,S) > c dup {G,S). 

Proof. Denote the image node set of the lea mapping M by M{G), which is a subset of nodes 
in the species tree S. For any internal node s G M{G), we use M~ 1 (s) to denote all internal 
nodes g of the gene tree that are mapped to s under M. For any nodes x and a descendant y 
of x in the gene tree G, if M{x) = M{y) = s, then M{g) = s for each node in the path from 
x to y. Since G is uniquely leaf labeled, all internal nodes in M _1 (s) form a rooted subtree of 
G, denoted by T~ x {s), as illustrated in Figure [2j 

T _1 (s) is not a full binary tree in general. In particular, its root might has degree 1. Let 
n' s , n", n'" denote the number of non-root degree- 1, degree-2 and degree-3 nodes in the subtree 
T _1 (s), respectively. Assume that T _1 (s) has two or more nodes. Then, by definition, the root 
of T~ x {s) corresponds with a gene duplication in the reconciliation of G and S; each degree-2 
or degree-3 node of T -1 (s) also corresponds with a gene duplication. Therefore, there are 
n's + n 's + 1 duplication events at s. We now consider two cases. 

Case 1. The root of T~ l {s) has degree 1. Then T _1 (s) has n'"+l leaves, that is n' s = n"'+l. 
For each leaf of T~ 1 {s), it has two children that are mapped to a node below s in the species 
tree S; each non-root degree-2 node has exactly one child that is mapped to a node below s 
and so is the root since it has degree 1. Thus, there are 2 {n"J + 1) + n" s + 1 image paths that 
contain one of the two lineages from s to one of its children. 

Case 2. The root has degree 2. In this case, T~ 1 {s) has n"' + 2 leaves and there are 
2 {n"J + 2) + n" image paths that contain one of the two lineages from s to one of its children. 
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(i) (ii) 



Figure 2: (i) A gene tree, (ii) A species tree. In the lea reconciliation M of the gene tree in 
the species tree, a is mapped to the green node, b, c, d, e, f and r to the red node, and g to 
the purple node. The nodes b, c, d, e, f, r form a subtree of the gene tree. 

By distributing the DC and duplication costs to each image node s in M(G), we obtain 
that 

c dc (G,S) (3) 

> (the no. of extra gene lineages on the branches leaving s) 

s£M(G):\T- 1 (s)\>l 

> £ (2n"J + < + 1) 

seM(G):\T- 1 (s)\>l 

> £ «' + < + !) 

seM(G):\T- 1 (s)\>l 

= c dup (G,S). (4) 
This finishes the proof. □ 

Remark The fact Cd c {G, S) > Cd U p{G, S) holds even for arbitrary gene trees in which 2 or 
more leaves with the same label, which represent genes sampled from the same species. In 
the general case, T~ s might be a forest - a union of rooted trees. However, the estimation 
in the proof is still valid if the sum is over all the subtrees that are mapped to a node in the 
species tree, i.e. T~ s is replaced by a subtree of each resulting forest. 

4 The NP-harness of the species tree problem in the 
DC cost 

Parsimony criterion is often used for inference in biology. Hence, inferring species tree from a 
set of gene trees is formulated as the following algorithmic problem 

Species Tree Problem 

Input: A set of gene trees G iy 1 < i < n. 
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Solution: A species tree S that minimizes the total cost J2iC(Gi, S), where c(, ) is a cost 
function. 

It is proved that the species tree problem is NP-hard for the duplication and/or loss cost 
in [19] . In this section, we prove the following theorem. 

Theorem 4.1 The species tree problem is NP-hard under the DC cost. 

Proof. Given a gene tree G and a species tree S, the DC cost Cd c (G, S) can be computed 
in polynomial time since gene duplications and losses can be counted in linear time [34] . 
Therefore, the species tree problem is in NP. 

To prove its NP-hardness, we reduce the Maximum Cut problem to the decision version 
of the species tree problem. Given an instance graph Q = (V, E) and a positive integer /, the 
Maximum Cut problem is to partition the node set V into two disjoint subsets V\ and V2 such 
that there are at least I edges from E that have one endpoint in V\ and one endpoint in V%. 
Assume that V = {vi, V2, • ■ ■ , v n } and \E\ denotes the number of edges from E, where n > 3. 
We construct a corresponding instance of the species tree problem as follows. 




Figure 3: Gene trees defined for each edge e = (uj, Vj). 

Choose N > n 2 and M > n 2 N(N + 1) + \E\. For each node Vi (1 < i < n), we introduce 
a label with the same name Vi. We also introduce 2N + M extra labels Xj, j/j, 1 < i < N and 
Zj, 1 < J < M. For each edge e = (vi, Vj) G E, we define two gene trees T e i and T e2 as shown 
in Figure [3j These two trees are same except that the leaf labels Vi and Vj are swapped. 
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Figure 4: 'Structural' gene trees. 

Let the trees shown in Figure 0] (i)-(iii) be written as L[xi,Xj,yk,zi\, L\y^y^ x k , z{\ and 
F[{xi},{yi},zi], respectively. Besides the 'edge' gene trees T el and T e2 (e € E), the set A of 
gene trees in the instance of the problem to be defined also contains 

G(i,j,k,m) = L[%i, Xj,y k ,z m ], 1 < i < j < N, 1 < k < N, 1 < m < M, 
G, (i,j,k,m) = L \Vii yj^ k ,z m ], 1 < % < j < N, 1 < k < N, 1 < m < M, 
Gl = F[{x l } 1 {y i ],z m l l<m<M. 

These three classes of gene trees are introduced to restrict the topology of the optimal species 
tree for the defined instance of the problem. Hence, we call them 'structural' gene trees. The 
NP-completeness of the decision version of the species tree problem follows from the following 
two lemmas. 

Lemma 4.1 // the graph Q has a cut of d edges, there is a species tree Sg having the DC cost 

c dc (A, S g ) = N(N + 1)\E\ + \E\ - d. 

Proof. Assume that the node set V of the graph Q divides into V\ = {i>i,i>2, • • • , v p } and 
Vi = {v p+ i, v p+ 2, ■ ■ ■ ,v n } such that there are exactly d edges having one endpoint in V\ and 
one endpoint in V 2 . We define a species tree Sg as shown in Figure [51 
First, we observe that 

Cdc{.G{i,j,k,m), Sg) = 0, Cdc(G{ij,k,m)i ^) = °) c dc{G" m , Sg) = 0, 

for each possible i,j, k, m. 
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Figure 5: Species tree Sg defined in Lemma |4~T1 

Consider a non-cut edge e = (t>j, Vj) (i < j). Since L(T el ) = L(T e2 ) C L(Sg), c dc (T el , Sg) = 
c dc (T el , S g \ L{Tel) ) and c dc (T e2 , Sg) = c dc (T e2l S g \ L(Te2) ). If v h v^ E V\, we have that 

c dc {T el ,Sg) = ^N(N - 1) + N + 1, c dc (T e2 , S g ) = ^N(N — 1) + N. 

Symmetrically, if Vi,Vj E V 2 , we have that 

c dc (T el ,S g ) = ^N(N -1)+N, c dc (T e2 , S g ) = ^N(N -1) + N + 1. 

For each cut edge e = (uj, Vj) (i < j) with one endpoint in V%, say V; L E Vi, and another in 
V 2 , we have that 

c dc (T el ,S g ) = 0, c dc (T e2 ,Sg) = N(N + l). (5) 

Therefore, we have 

c dc (A, S g ) = N(N + 1)\E\ + \E\ - d. 
This finishes the proof of the lemma. □ 

Lemma 4.2 If there is a species tree S having the DC cost c dc (A, S) = N(N + l)\E\ + t, then 
the graph Q has a cut of at least \E\ — t edges. 

Proof. If t > \E\, the fact is trivial. Hence, without loss of generality, we may assume that 
t < \E\. Here, we use LTreefa, . . . , b, c] to denote the line tree with leaves labeled by a, b, 
. . ., c, respectively, as shown in Figure E] (i). Note that the leaf a is a child of the root in 
LTree[a, . . . , b, c]. For a set of trees T', T", . . ., T'", we use 

LTreefT', ...,T",T W ] 
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(i) (ii) 



Figure 6: (i) Line tree LTreefa, . . . , b, c]. (ii) The resulting tree LTree[T', . . . , T", T"'\ after 
replacing each leaf with a tree in a line tree. 

to denote the tree obtained by replacing each leaf by a corresponding subtree in LTree[a, . . . , b, c] 
as shown in Figure [6] (ii). 

Let B be a subset of leaves in the species tree S and the least common ancestor of the 
leaves from B be tb in S. Recall that the homomorphic subtree S\b of S induced by B is the 
tree obtained from S by removing all the nodes and edges that are not on a path from tb to 
a leaf from B and then contracting all the degree-2 node except for the root Tb- For example, 
for Sg defined in Lemma l4~Tl Sg\{g. 1>Xityi } = LTree[t/i, x±, x 2 \. 

Set 

U = {x 1 ,x 2 , ■ ■ ■ ,x N } U {2/1,1/2, • • • ,Vn} U {v h v 2 , -,v n }] 

Z = {z 1 ,z 2 ,--- ,z M }. (6) 

By replacing the children of a two-leaf root tree with S\u and S\z, we obtain a species tree 
S' = LTree[S\u, S\z] from S. First, S' has the following property. 

Fact 1 c dc (A, S') < c dc (A, S) = N(N + 1)\E\ + t. 

Proof. For each gene tree T = T e \ or T e2 , we use / and /' to denote the lea mappings from 
T to S and S', respectively. For each edge e = (u%, u 2 ) in the spanning subtree over ^s in T, 
by the definition of S\z, f(ui) = f(u 2 ) if and only if f'(ui) = f'(u 2 ) and d(f'(ui), f'(u 2 )) < 
d(f(ui),f(u 2 )). For each edge in the spanning subtree over x^s, V{, Vj and yiS, the same 
property holds. But, the edges incident to the root of T may not satisfy the property discussed 
above. Let r be the root of T. Assume that a(r) is the left child of r, which is the least common 
ancester of x^s and ?/jS, and b(r) the right child of r. It is possible that f(r) = f(a(r)) and/or 
f(r) = f(b(r)). However, f'(r) = r', f'(a(r)) = a(r') and f'{b(r)) = b(r'), where r' is the 
root of S 1 , a(r') and b{r') the root of S\u and S\z respectively. Since no other lineages fail to 
coalesce with (r, a(r)) on (r', a(r') and with (r, b(r)) on (r', b(r')) respectively, these two edges 
does not affect the deep coalescence cost. Thus, Cd c (T, S') < Cd c (T, S). 
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Similarly, we also have the following three inequalities 

c dc (G(i,j,k,m),S') < c dc (G(i,j,k,m),S) 
c dc (G'(iJ,k,m),S') < c dc (G'(i,j,k,m),S) 
c dc (G"(m),S>)<c dc (G"(m),S) 

for any i, j, k,m. Thus, the fact holds. □ 

Fact 2. In S\u, all the leaves Xi must be below one child of the root and all the leaves yi 
must be below the other child of the root. In other words, S\y — LTree[Tx, T 2 ], where Ti is a 
tree over Xi and some i^s and T 2 is a tree over y^s and some VjS. 

Proof. Assume that the fact is false. There are Xi,Xj and yj, such that S\^ XitXjm y = 
(S\u)\ {Xl , Xj , yk} = LTreefo, x j: y k ], or there are y i: yj and x k such that S\ {yuyj>Xk} = (S\u)\ { y ityj , Xk} = 
, yj, Xk]- If the former is true, then, 

Cdc(G(ij,fc,m), S') > 1, 1 < m < M. 

This implies that 

M 

N(N + 1)\E\ + 1 > c dc (A, S') > Cdc(G {iJtktm) , S>) = M, 

m=l 

contradicting to the fact that M > N(N + l)n 2 . If the latter is true, for any 1 < m < M, 
Cdc{G'^ ■ k m y S')>\. Again, we have that c dc (A, S') > M, leading to a contradiction. □ 

Let X = {xi,x 2 , ■ ■ ■ ,x N } and Y = {y x , y 2 , . . . , y N }- Then S'\ x = (S\u)\x and S'\ Y = 
(S\u)\y. 

Fact 3. S'\x = LTree[xi,x 2 , . . . ,xjy] and S'\y = LTree[?/i, y 2 , ■ ■ ■ ,Un]- 

Proof. Note that G'^\x = LTreefxi, x 2 , ■ ■ ■ , xn] and G^Jy = LTree[yi, y 2 , . . . , yAr] for any 
1 < m < M. If the claim is false, then, c dc {G" m , S') > 1 for any m and hence 

M 

N(N + i)\E\+t> c dc (A, S') > c dc(Gl, S') = M, 

m=l 

a contradiction as in the proof of Fact 2. □ 

Let the least common ancestor of x^s and y^s be r in S'. We have shown that XiS are below 
one child of r, say r±, and y^s are below the other child of r, say r 2 . In addition, S'\x and S' Y 
are two line trees. 

Fact 4. For each edge e = (vi,Vj) (i < j) such that Vi and Vj are in the same subtree as a^s 
or as yiS, then 

c dc (T eU S') + c dc (T e2 , S') > N(N + !) + !. 
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Proof. Without loss of generality, we may assume that Vi and vj are below r\ in the same 
subtree as XjS. We consider the following cases. 

Case I. S\xu{v lt v 3 } = LTree[x 1 ,x 2 , ■ ■ ■ ,x k ,Vi,x k+1 , ■ ■ ■ ,x m ,v j ,v m+1 , ■■■ ,x N ] for some < 
k < m < N . In this case, we have that 

c dc (T el , S') = ^N(N - 1) + N + 1 + ±(N - k)(N - k - 1) 

and 

c dc (T e2 , S') = ^N(N - 1) + k + 1 + 1 -{N - m)(N - m - 1). 

Hence, 

c dc (T el ,S r ) + c dc (T e2 ,S') 

> N(N -l) + N + l + h(N- k)(N - k - 1) + 2k + 2] 

> N(N +1) + 1 

as the minimum value of (N — k)(N — k — 1) + 2k + 2 is N (reaching at /c = N — 2, iV — 1). 

Case 2. S'lxuj^,^} = LTree[zi, x 2 , • • • , LTree[^, Uj], • • • , zjv-i, x N ] for some < 
k < N. We have that 

Cfc(T el , 5') = c dc (T e2 , S') = ^N(N -i) + k + 2 + ^(N-k)(N-k-l) 

and so 



Cdc{T e \, S') + c dc (T e2 , S') 

> N(N — 1) + 2k + A + (N — k)(N — k — 1) 

> N(N + l) + 2 

as the minimum value of 2k + (N - k)(N - k - 1) is 2N-2 (reaching at k = N - 1, N - 2). 
The fact is proved. □ 

Fact 5. For each edge e = (t>j, u,) such that is below r x in the same subtree as Xi and u,- is 
below r 2 in the subtree as yjS. Then, 

c dc (T el , 5') + Qc (T e2 , 5') > AT(7V + 1). 

Proof. Let 

S\xu{ Vi } = LTree[xi,x 2 , . . . , x k , v h x k+1 , . . .,x N ^,x N ] 

and 

SWuivj} = LTree[y 1 ,y 2 , ■ ■ . , y m , vj, y m+1 , . . .,y N -!,y N }. 
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We have that all the internal nodes in T e2 are mapped onto the least common ancestor r 
of XjS and yjs and thus 

Cd£ (T e2 ,S') = N(N + l). 
Since Cdc(T e i, S') > 0, the fact is proved. □ 

Let V\ denote the subset of leaves Vi below r x in the same subtree as XjS and V 2 the subset 
of leaves Vj below r 2 in the same subtree as yiS. Then (Vi, V 2 ) is a cut of the graph Q. Assume 
there are p cut edges. Since there are \E\ — p non-cut edges, 

N(N + 1)\E\ +t 
= c dc (A,S') 

> (\E\-p)N(N + l)+pN(N + l) + (\E\-p) 
= N(N + l)\E\ + \E\-p, 

which implies that p > \E\ — t. This finishes the proof of Lemma [4.21 □ 



5 Conclusion 

We conclude this paper by posing two related research problems. In this paper, we have 
proved that species tree inference by minimizing deep coalescences is NP-hard. This justifies 
the effort from different groups in seeking efficient heuristic methods for the inference problem 
[2T| [32] . We have also discussed the relationship of the deep coalescence cost and the gene 
duplication cost. Is there any polynomial-time algorithm with constant approximation ratio 
for the species tree problem in the deep coalescence model? Note that the heuristic method 
developed by Than and Nakhleh in [32] seems to be effective. 

In [9] , Stege studied the parametric complexity of the species tree inference by minimizing 
gene duplications. Is is possible to develop efficient algorithm for parametric species tree 
inference under the deep coalescence model? 
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